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ABSTRACT 

Student evaluations of teacher effectiveness have 
been accepted by instructors as helpful indicators of perforiance, 
but their validity and use in tenure and <proaotion decisions has been 
questioned by faculty. Students and instructors in 207 social science 
courses completed evaluations df instructional effectiveness at the 
conclusion of the seaester. Bach of the 65 participating faculty ' 
■eabers designated the course in which his or her teaching had been 
the aost and the least effective. The instuctors then evaluatr^. their 
teaching in both courses. Instructor and student evaluations 
contained identical iteas, saaples of which are appended. Fac ,ty and 
stttdehts agreed upon six factors of teacher -effectiveness: breadth of 
coverage, organization, group interaction, individual interaction, 
instructor enthusiasa and learning/value. Factor analysis rttealed 
th^t student and faculty agreement on evaluation factors vas high. 
Student evaluation of the courses designiated Bost effective by 
instructors was. higher cn all scores;r The aedian evaluation was the 
saae for both groups. The study indicated that self-e valuation is 
beneficial to faculty, and tfiat student evaluation of teaching 
effectiveness is a valid process worthy of faculty confidence. ' 
(Author/JAG) 
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' / Abstract 

i 

.! ■ • - • - 

Faculty who taught two courses during the spripg 1976 semester evaluated • 
their own^teaching in each of the two courses as well as being evaluated by 
their students. These faculty felt that quality of teaching should be given 
more Importance and that students' evaluations were useful for the faculty 
themselves,! but expressed reservations about both the validity of the student ^ 
ratings ana their. use in tenure/promotion decisions. In spite of these * , 
reservations, there was considerable student-faculty agreement. Separate 
factor an^ilyses of the two sets of ratings both supported the six evaluation 
factors which had previously been identified, indicating student-faculty 
agreement on the dimensions which underlie ratings of effective teaching • 
Validity coefficients, correlations between student and faculty ratings on 
the same evaluation factors varied between .33 and .67 (median r = .49). 
The difference between mean faculty self-evaluations, averaged across faculty, 
and mean students' evaluations were small; the median evaluation was the 
same for both groups. Students and faculty agreed upon the teaching behaviors 
which were mbre descriptive and, less descriptive of the faculty as a whole 
(r = ,77). Finally, when faculty indicated that their teaching was "most 
effective" in onfe course and "less effective" in another, students' evaluation 
of the "most effective" courses were higher on all evaluation scores. These 
findings offer strong new support for the validity of students' evaluations, 
suggest the possible usefulness of faculty self -evaluations, and should be 
n particularly helpful in c^vercoming faculty resistance to the use of students^ 
evaluations, . ^ 



The Validity of Students' Evaluations of 
Instructional Effectiveness: A Comparison 
of Faculty Self-Evaluations and Evaluations 
by Their Students 



Students' evaluations of instructional effectiveness"^ continue to be 
both widely used and controversial. Decreasing enrollments and an fin'l^ased 
emphasis on accountability, often externally motivated, have .brought about a 
renewed interest in evaluating quality of instruction. ^^Cons^uently, students' 
evaluations— often the only measure of teaching effectiveness which are 
regularlyavailable— not only impact on faculty self-esteem and reputation, 
but may also affect their careers. However, demonstrations of the validity of 
the- students' evaluations. are generally limited to specialized^6ttings or 
employ criteria which can be^easily challenged./ Faculty^ like other hsmn ^ 
beings, are suspicious of the processes used to.evaluate them. As long as 
there remains broad faculty distrust of the validity of students' evaluations, 
their usefulness will be severely limited. The purpose of this study is to 

0 

show that faculty self-evaluations of their^own teaching, in spite of faculty 
y-eservations about the validity of the^student ratings, show good agreement 
with students' evaluations of teaching effectiveness.* , 

' The most common ^criticism of students' evaluations, besides the feeling 
that they lack validity, is that they are biased by variables unrelated to 
teaching effectiveness. There is considerable evidence that most background 
variables such as class size, reason for taking the course, workload and 
grade point average have little relationship to students' evaluations (Marsh 
1978 Marshi Overall & Thomas, 1976"; McKeachie., 1973; Remmers, 1963). How- 

0 

ever, this apparent lack of bias does not necessarily mean that students' 
evaluations are also valid measures of instructional quality. Validating' 
students' ratings is difficult because there are no clearly defined criteria 
of instructional qua'lity. Indeed, validating the measurement of any complex 
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construct like effective teaching requires the use of many alternative criteria. 

•So*'" f 

Validity studies have typically used student performance on standardized 
examinations as a criterion. When different instructors teacjj different 
sections of the same course, the sections which evs^luate their instructor 
most highly are also the sections that perform better on the standardized 
examinations gIvenHo all sections (Overall & Marsh,. 1973, Centra, 1977; 

'Marsh^Flelner S Thomas* 1975; Frey, 1973; Cohen & Berger, 1970; Morsh, • . . 

"Burge'ss & Smith, 1955),' though Rodin and Rodin (1972) (fid report negatiV^ 
findings. Overall and Mar9hv(1978) also showed the^ affective consequences? 
of a course (feelings of course mastery and disposillon to pursue ths subject 
further) were significantly related to students' evaluations. Marsh (1977), 
considering an alternative criterion, asked graduating seniors to nominate 
"most outstanding" and "least outstanding" faculty in | separate survey which 
was mailed to them. .Classroom evaluations of th¥""most outstanding" faculty 
Weire^ consistently much more favorable than the evaluations of the "least out- 
standing" faculty teaching similar classes* This implies that the qualities of 

an irtstructor which cause.iliim to be nominated by graduating seniors are also 

' -i. » . .. "J . 

reflected in cls^ssrot^m evaluations;-:: ^ ' • * 

Obvious criteria, tor valfdating^students ' evaluations are the corres- 

""ponding evaluations irtade'by the faculty themselves (self-evaluations) or the 
evaluations made by other faXulty (colleague evaluations) • Morsh , Burgess 

• and Smith, (1955) collected evafcTatlons of instruction in an Air Force Training 
Program from students , colleagues, and supervisors in addition to measuring 
actual student learning on a standardized examination. Al though/there was 
a strong positive relationship between students' evaluations and actual 
learning (r = .49), there was no significant relationship between actual 
learning and either colleague or supervisor evaluations. Guthrie (1954), ^Maslow 
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and Ziirmermari (1956) and Blackburn 'and Clark (1975) have all reported 

to moderate correlations between colleague ratings and student ratings 

-i - 

(correlations ranging from .43 to .69 on ratings of overall .teacher effec- 

c ^^^^ 

tiveness). However, none of the studies actually required colleagues to 
base their evaluations on visits to the classroom, and it is likely that at 
least part of the basis of the colleague evaluations was feedback* from 

- V 

students. ) ^ J 

Centra (1975) compared colleague and student evaluations in a setting 
which re^duc^d the probable confounding of the two sources of information. 
Colleague evaluations were based upon actual classroom visitation, and the 
study was conducted in a university at which taaching reputations had not 
yet been'established. Each of 54 faculty was evaluated twice by each of 
three different colleagues. While there was good agreement between the 
evaluations ofxthe same colleague on different visits (r = .78), there was 
little agre^ent between the evaluations of different colleagues (r = .26). 
The lack o/reli ability in the ^colleafiue ratings, precluded any good correspon 
dehce with student evaluations arid the median correlation between %)e two 
groups for 16 evaluation items was only .*20. 

■ ' ' -Blackburn and Clark (1975) reported a correlation of only .19 between 
faculty self-evaluations, and student evaluations. However, the faculty self- 
evaluations WerQ: only general impressions of teaching effectiveness which 
we>e not^-^ied to actual performance in a particular cour*se, while the student 
evaluations were ba'sed upon. actual teaching. Centra. (1972) asked faculty to 
select a single course in which to evalua^te themselves and to be evaluated 
by their students. Facul ty' self-evaluations of their selected course tended 
tn be more favorable than the evaluations_by_Jtheir students and showed only 
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. modest correlajtions with student ratings; the median correlation for the 17 
evaluation items '^as r*= .21. This indicates that faculty wHo saw them-, 
selves as most effective were also evaluated somewhat more favorably by their 
students. However, items which received consistently high or low ratings 
by all faculty were also given similar ratings by all students. The correlation 
'between faculty mean responses to the .evaluation items and student mean responses 
to the same items was .77. While faculty and students showed only modest 
agreement on which faculty were most effecr^e teachers, there was good agreement 

\ QjD^he behaviors at which faculty as a whole at^ best and worst. 

Previous research has found only modest correspondence between faculty 
self-evaluations and student evaluations. However, /acuity evaluating their 
own teaching in a general sense, or even in one specific course^ are not 
forced to differentiate between their own more and less effective teaching. 
Students, on the other hand, have a wide basis of comparison against which 
to evaluate the performance of any given instructor. In the present study, 
faculty who had just completed teaching two different courses were asked to 
indicate in which course their teaching was more effective and in which less 
and to evaluate their teaching in both courses. This procedure assured that 
faculty self-evaluations were based upon teaching in a specific course and 
• forced faculty to differentiate between their own more and less effective 
teaching. ^ < - 



t 



J 



Faculty Sfelf -Evaluations 5 

METHOD 

Students' Evaluatiuns and Survey Instrument " • 

During the Spring 1.976 semester at the University of Southern .California, 
Students' evaluations were collected i^j a total of 207 undergraduate courses 
which were taught^ by faculty in the Division, of Social Sciences, /Graduate 
level courses and courses taught by Teaching Assistants were not included in 
these analyses* Student evaluation instruments were sent to faculty in charge 
of all courses several weeks before the end of the semester and^were actually 
used in virtually all of these courses* The evaluation forms were administered 
during a class period prior to the final examination, collected by a student 
in the class, and immediately taken to the department office. An average of 
78% of the students enrolled in these courses completed the survey forms. 

The evaluation instrument consisted of 24 evaluation items adapted from 
Hildebrand, Wilson and Dienst (1971) and several additional background/demo- 
graphic variables. The reliability of individual evaluation items (Marsh, 1976b), 
based upon sets of_ responses from 20 students in each class, varied between .73 
and .90 (median .84). Coefficient alphas, determined for both students' evalua- 
tions and faculty self-evaluations as part of this study, were computed according 
to Method 2 of the Statistical Package for Social Sciences (Nie et al., 1977). 

Both the students' evaluations and the faculty self -evaluations were 
summarized by eight evaluation scores, factor scores representing the six 
evaluation factors and overall ratings of the teacher and the course. Evalua- 
tion factor scores were weighted averages of standardized items. The weights, 
factor score coefficients, wene derived from a factor analysis ( Nie et aK, 1975) 
done across all courses which were evaluated during a two sanester period of 
time (Marsh, 1976a). The evaluation scores are characterized as follows: 
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BREADTH OF COVERAGE— Presents a broad background encompassing alternative 
approaches to the subject and emphasizing analytic ability and concept- 
ual understanding. 

ORGANIZATION— Is well organized and prepared, giving explanations and 
answers which are \clear. 

GROUP INTERACTION— Encourages class discussions and invites students to 
share their own ideas ^r be critical of those presented by the instructor. 

INDIVIDUAL INTERACTION-Is friendly and interested in students and Is / 
accessible to them*, / 

INSTRUCTOR ENTHUSIASM— Displays enthusiasm, energy, humor and ability td 
hold student interest. 

o 

LEARNING/VALUE— The extent 'to which students experienced a valuable learn- 
,ing experience which was intellectually demanding. 

OVERALL INSTRUCTOR-'-A single item asking' "How does this instructor compare 
with other instructors you have had at this school?" 

OVERaLl COURSE— a single item asking "How does this course compare with 
other ^courses you have had at this school?" 



Faculty Self-Evaluations and Survey Instrument 

During the'1976 Spring Semester*'^^ different instructors in 'the 
Division of Social Sciences taught a^t least two courses in which they were 
also evaluated by their students Faculty SeXf -evaluation survey was 
sent to these teachers at the efi'd of the term, but before summaries of the 
students^ evaluations had beeri returned ♦ Faculty were assured that their 
responses would remain conftdential . Instructors indicated in which course 
their teaching was "most effective" and in which "less effective" and rated 
the difference in effectiveness* Faculty then evaluated both courses with 
, a^set of items which were identical to those used by students except th?t 
they were worded in the' first person. Faculty were asked to rate their own 
teaching effectiveness and not to report how students would rate ^them. The 
survey f^ also contained additional items related to their jttitudes towards 
students' evaluations and selected background/demographic variables. 
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A tojal of 51 (7856) surveys, including evaluations of 83 different 
undergraduate courses, were returned in response to the original sjjrv.ey 
and tvwt^^ditional mailings to non-ftspondents. Thirty-two of the respon- 
dents evalua-^^d two undergraduate courses which they designated to be "most 
effective" or "less effective". The remaining 19 respondents either 
evaluated one under^aduate course and one graduate level course or evalu- 
ated only one undergraduate level course. ^ 

, " RESULTS 
F actor Analysis ^ 

Factor analysis is used to describe the underlying dimensions which are 
actually being measured by a set of questions. The technique identifies 
clusters of items which are highly related to each other and less related to 
other clusters of items.' The simple structure criterion for factor analysis 
attempts to determine dimensions so that any given item loads high on one 
dimension and low on iilT others. Then^ the underlying dimension is "named" 
by characterizing the items which load highest on it. Typical uses of factor 
analysis are exploration of the pattern of relationships which exist between ^ 
different variables, confirmation of hypothesized , relationships which exist 
between different variables, and the construction of scales which have 
greater reliability- and generality than the individual items.. In this study 
the factor pattern underlyiiil^^students^ evaluations had already been deter- 
mined (Marsh, 1976a), and a set of 24 evaluation items designed to measure 
six evaluation factors had been developed. However, faculty self-evaluations 
had notyoeen previously factor analyzed. The purpose of this analysis was to 
deteomne if the factors underlying the students • evaluations were replicable 
and if they were similar to those underlying faculty self -evaluations of their 
own teaching. 
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The set of 24 evaluation items'and the factors which they were designed , 

to define are presented in Table One.- The factor loadings of the students* * 

♦ • 

evaluations and the faculty *self -evaluations both offered support for. these six' 
evaluation factors; items loaded higher on the factors they were designed'to 
measure, t^an^ on otber fa^ctorjf. There was only disagreement on two 'Items; 
student responses to the i,tenf "answers questions carefully" put the 
Item in -the ORGANIZATION factor, whiTe the faculty responses placed^it in 
the LEARNING/VALUE factor; student responses to the item ^^'cfiscusses 
recent develo^nts" placed the item in' the BREADTH factor, while faculty 
responses .placed it in the. ORGANIZATION factor. ' " , , 



• . INSERT TABLE ONE ABOUT" HERE . . * . 

• " t ^ f N 

In summary, factor analyses of both the stude,nts* evaluations and 
.the faculty self-evaluations supported the existence of the same six evalua- - 
tion factors the items were designed to measure. The similarity in the 
factor patterns implies that the two groups agree upon the dimensions which 
underlie evaluations of effective teaching. This analysis does not show that 
the actual ratings which -students give a teacher will agree with those which 
the teacher gives himself- In the next section thfe student^ evaluation facto>s, 
scores representing the six factors based upon the student ratings, are 
correlated with the corresponding faculty self-evaluation factors* 
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Convergent-Divergent Validity 

• .Catmpbell and Fiske'Cl959) advocate the assessment of validity by 

determining measures of mone than one trait, each of which is assessed 

« 

by more than, one method. In the present application, the multiple traits 
are the six evaluation factors while the multiple' methods refer to the two 
distinct groups of raters— the students and the faculty. Convergent 
validity, that which is most typically determined, is the correlation 

^ betweeif . the same evaluation factor rated by the two different groups. 
Oiscrtmlnant .validity refers to the distinctiveness of each of the evalua- 

' tidn factors.--' • * • 

Convergent an^i divergent validity were determined by' eJcami ni ng the 
set 'Of correlation matrices in Table TWo* The two triai|llar matrices 
contain the* correlations between different evaluation factors as assessed 
bj^sthfe same group of raters; intercorrelations between student evaluation 
fact&iK (ijpper/left) and faculty * "-evaluation factors (lower right). 
The diagonals of these triangular matri^e^ contain the reliabilities of 
the factors for each group of raters-:^- The square matrix (lower^left) con- 

\ 

tains the correlations between student evaluation factors and fac ty 
self-evaluation factors. The diagonal of the square matrix, the convergent 
validity coefficients, are the ^Correlations between the same evaluation * 
factors assessed by the two groups of raters. .Since there was ^substan^tsl 
unreliability in many of the faculty self-evaluation factors, the convergent . 
validity coefficients have been corrected for unreliability (Nunnally, 1967). 

Convergent .validity requires that the diagonal values of the square 
matrix be substantially higher than zero. Inspection of Table Two shows that 
this was the case for all evaluation factors; there was. substantial :;^eement 



V 
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— -I - 
betweisn s^tudents and faculty. These validity coefficients, corrected for 
unreliability, varied between ^33 and .67 (median r = *49). These findings 
offered good suppor* for convergent validity of students' evaluations. 

INSERT .TABLE TWO ABOUT HERE 



Divergent validity was much harder to assess, and Campbell and Fiske 
(1959) offered only general guidelines. The minimal cOfidition is that all 
correlations between different factors rated by the same group (off-diagonal 
correlations in the triangular matrices) must be substantially lower than 
the reliabilities of these factors. For example, even though the correlation- 
between student ratings of ENTHUSIASM and GROUP INTERACTION was .54, this 

correlation was still mucjj lower then either of the reliabilities of these 

^< " " « 

two factons>(*.92 and \93 respectively). This first condition was clearly 
met..for both student and faculty ratihgs. A second condition is that each 
convergent, validity coefficient must be higher than any other correlation in 
the same row or column of .the square matrix. This condition was also met 
in all cases. A third condition is that a similar pattern of correlations 

c 

exist in each of the triangular matrices and the square matrix. This was 
generally the case, particularly if only the correlations which oWere statis- 
tically significant are considered. A final condition, the most stringent, 

^suggests that each convergent vahVity coefficient should be higher than 
correlations between that factor and any other factor assessed by the same 
group of raters. This condition implictly assumes that the evaluation . 
factors arc^truly uncorrected, clearly not the case for teaching evaluation 

'factors, and so i*^ nnly somewhat relevant. This condition was met for the.v 

faculty self-evaluations, but was only partially met by the students* evaluations'! 

This .suggests that there may be some *'haTq effect" in tjie students' evaluations. ' 
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In summary, there was very good support for the convergent validity 
of the teacher evaluations, and reasonably good support for their divergent 
validity. ^Student- faculty agreement on the same evaluation factors was 
quite high; validity coefficients corrected for unreliability varied between 
.33;-and -.67 (median r = .49).- Furthermore, the agreement was specific to 
student and faculty ratings on the same evaluation factor. For example, 
students' evaluations of ORGANIZATION correlated h°1ghly with faculty self- 
evaluations of ORGANIZATION, but did not correlate substantially with 
any other factors. Reliabilities of the student evaluatioi#f actors were 
high (median r = .90), but were lower for faculty self-evaluation factors 
(median r = .70). Correlations between .different student evaluation factors 
werci somewhat higher than is desirable (median r = .39), perhaps indicating, 
some halo effect, but were substantially less than the reliabilities of the 
factors. 

* ■ 

Student-Facul 'i y Agreement--Absolute and Relative 

Results in the last section indicated that the correlations between 
student evaluation factors and faculty self-evaluation factors were quite, 
high. However, this does not imply there was absolute agreement since 
correla'^tions can only assess relative agreement. For example, if. eaqh 
instructor always rated himself exactly one c^tegbry higher than did his 
.students, there would be perfect relative agreement (a correlation of 1.0) 
even though^ there would not be absolute agreement. The purpose of analysis 
in this section is to test both relative and absolute agreement of individual 
J evaluation items. 

Mean faculty sefeevaluations were very similar to the mean of students' 
evaluations (See Table Three). For the 24 evaluation items, the median 

14 \ 



Faculty Self-Evaluations 12 



rating was exactly the same for both' groups; 4.07 on the five-point response 
scale. Differences between faculty and student ratings only reached statis- 
tical significance on five items; students' evaluations were higher on two 

o 

and lower on three. 

The mean faculty self-evaluation on each of the 24 evaluation items, 
averaged across all faculty responses, correlated quite, highly (r = .77) 
with the mean student ratings. This implies that students and fa^culty agree 
upon what the faculty as a whole does best and worst. For example, both 

4 m 

faculty and students rate faculty as most effective at being' enthusiastic, 
being friendly to students, enjoying teaching, being well prepared, and 
having an interest in students, but perceive faculty to be least effective 
at giving lectures which are easy to outline, knowiii^.vfhen students are 
bored and confused, and enhancing presentations with the effective use of, humor 

^Correlations between faculty self-evaluations atid ratings by their 
students are presented in Table Three. These correlations are similar in 
meaning to those presented in Table Two, but show agreement on Individual 
items rather than factors. Correlations were significantly positive on 23 
of 24 evaluation items, the median correlation being r = .30. It is inter- 
esting to note that this is lower than the median correlation for factor 
scores, even before they were corrected for unreliability (median uncorrected 
r « .39). The higher correspondence between the evaluation factors Was 
primarily due to the greater reliability'of the factors, compared to the 
reliability of individual items. 

In summary, these findings indicate that there was good agreement— both 
absolute and relative agreement— between students' evaluations and the 

" 

corresponding evaluations by their teachers." - Differences between mean student 
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natlngs and mean faculty self-evaluations were small, correlations between 
ratings by the two groups were statistically significant on 23 of 24 items, 
and there was' good agreement between the two groups about what teaching 
behaviors were more descriptive and less descriptive of the faculty as a • 
whole. 

Differentiation Between "Most" and "Less" Effective Courses 

Faculty in this study, unlike others which were discussed, selected 
one .course in ^hich their teaching was "most effective" and another course 

-a>> 

in which their teaching, was "less effective". Many potential problems 
inherent in^the use of category ratings were avoided with this procedure. ' 
While category ratings are also the basis of students* evaluations, each 
Student is exposed to a variety of different teachers, and the evaluation 
of each teacher is based upon the responses of many different students. On 
the other hand, faculty self-evaluations are based upon the response of a 
single individual who may not have taken an undergraduate course for a decade 
or rr.ore. The self-rating of "4" by one instructor may or may not be differ- 
ent in meaning from the "3" by another instructor. However, if the same Instructor 
gives himself a "4", in one claiss and a "3" in another, it is clear that he 

< 

feels that there is a difference between the two cusses. This methodology 
also forces faculty to evaluate critically the difference- in their teaching 

V 

effectiveness in the two courses* " V 

This procedure does present a much more demanding test of the students • ^ 
evaluations. While the difference between a very good teacher. and very poor^ 
one may be readily apparent, the difference in teaching effectiveness of the 
same instructor in two different classes is much more subtle. Furthermore, 
virtually all of the faculty taught only two courses during the semester, so 
there was a limited range from which to select a "most effective" and "less 
eTfi^cXive" course. In fact, a number of faculty (22% ) indicated there was 

16 ' 
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Tittle or no difference in their effectiveness in the two courses even 

, ^ "\ 

though they. did indicate one'as^"most effective." 

students' evaluations significantly differentiated betwe^'the "most 
effective" and "less effective" courses (See Table Four). Differences on 
each of the eight e\?al.ualion scores separately and the multivariate differ- 
ence based upon the entire set of evaluation scores were all statistically 
S'lgnificant* The largest differences were .for the Overall Instructor rating 
and the ^Instructor Enthusiasm factor* Faculty self-evaluations also ' 
differentiated between the two groups of courses, but differences were not 
statistically significant for all of the evaluation scores. The'^'rnost effec- 
tive" and "less effective" courses were also compared to 10 background/demographi 
variables which describe the instructor, the student, and the course* 
Differences between the two groups of courses failed to reach statistical 
significance on nine of the ten variables (See Table Five), as well as the 
multivariate difference based upon all ten variables. 



INSERT TABLES FOUR AND FIVE ABOUT HERE 



In summary, faculty who had taught two courses during the same semester 
were asked to indicate the course in which their teaching, had been "most 
effective"' ^nd "less effective". Students' evaluations of the "most effective" 
courses were significantly higher for all evaluation scores, even though the 
two groups of courses, were similar on 10 background/demographic variables. - 
Inspection of Table Four indicated that the students' evaluations actually 
show better differentiation than did faculty self -evaluations of their own 
teaching* / 
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DISCUSSION . ■ 

Faculty who taught^two undergraduate coursel during the same semester 
evaluated their own teaching in each course* as well as being evaluated by 
their students. Both faculty and students used essentially the same evaTua- 
tion forms. In spite of Faculty skepficYsm concerning the validity of 
'students' evaluation, there was very jgood student/ faculty agreement. 
Separate factor analyses of the two sets of ratings indicated student- faculty 
'agreement on, the evaluation dimensions wfilch underlie ratings of teaching 
effectiveness. Validity coefficients, correlations between student ratings 
and faculty ratings on the same evaluation factors, were all highly- sign'ificap't 
(median r = \^9). The reliabilities of the students' evaluations Were high 
(riiedian r^= .90) though faculty self-evaluations were ^'sorfiewhat less* reliable 
(median r = .70). Mean faculty self-evaluations, averaged acrqss all faculty, 
responses, were generally similar to the mean students' evaluations; the median 
rating for the 24 evaluation items was 4.07 for both groups. I^urthermor^, * 
there was faculty-student agreement upon which teaching behaviors were more 
descriptive 'and less descriptive of the faculty as a whole (r = C77). Finally, 
when faculty indicated that their teaching was relatively more effective in , 
one of the two courses in which they evaluated themselves, students' evaluations 
of the "most effective" courses^ were significantly higher for each of the 
evaluation factors and both the overall sumiiiary ratings. In fact, students' 
evaluations better differentiated bet\'een;cqurses in which faculty Indicated 
that their teachings was "most effective" and "Iv^ss effective" than ^did. the^ 
faculty self-evaluations of their own teaching. 



Faculty Self-Evaluations 16 

6 - - - _• . _ 

Previous research (Centra, 1972; Blacbarn & Clark, 1975) reported lower / 

validity coefficients than were found in this study and also reported that 

faculty self-evaluations were consistently higher than were students' 

evaluations. The differences'ln their findings may well be due to the flower 

reliability of the meEfsurement instruments which they used and the different 

methodologies which they employed. Blaqkburn and Clark (1975), reporting . 

the lowest level of agreement, had each teacher r^te himself on a single 

global item of "overall teaching effectiveness" which was not actually tied /, 

to performance in a particular course. , Reliability data was not presented ^ 

for either the students' evaluations, also assessed with a single item, or 

the' faculty self-evaluations. Centra (1972) asked- facul ty to select one 

course in.which they woutd, evaluate themselves and be evaluated by their students 

at the middle of the- semester. The effect of the timing of the evaluations, 

coming at the middle of the semester, is .not known, but this is an obvious 
difference in methodology. Furthermore, wiany faculty in the study taught 

more than one course and probably selected the course in which they felt their • " 

teaching was most effective. If this selection bias did exist, it would pro- 

duce more favorable faculty self-evaluations, it would limit the range of. 

teaching effectiveness which was actually observed (lowering the correlations 

between the ,two sets of evaluations), and it would not force faculty to 

differentiate between their' own more effective and less effective teaching. 

Although Centra did include items which apparently tapped different evaluation 

factors, validity coefficients were based upon individual evaluation items 

rather than evaluation factors.' Reliabilities of Centra's .items (Centra, 1973) 

were'sonjBwhat lower than reliabilities of individual items in the present 

study and' were markedly lower than the reliabilities of the evaluation 

fdbtQ^s. Centra presented no , data on the reliabilities of the faculty' 

self-evaluations. 
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Faculty in this study, as is generally the case, were somewhat skeptical 
about' the validity of students' evaluations and their use. Facu-lty did indicate 

J that sbm^ measure of teaching effectiveness should be given more weight in tenure/ 
promotion decisions and dfd indicate that the stude^lts' evaluations were useful 
to the faculty themselves. ''{78% agreed with the statement that '^students' evalu- 

nations can provide instructor with information which is useful for the improvement 
of "the course and/or qQality of teaching", whirls. only 2%^di^greed . ) ,-4l6w^V^r^ Faculty 
were skeptical about the validity of students' evaluations. .(Only 28% agreed with 
the statement that "studeats* evaluations represent accorate assessments of instruc- 
tional quality.) Similar reservations were expressed about the use of students ^ 
evaluation in tenure/promotion decisions. Furthermore, faculty were egual.ly skeptical 
about other possible measures of teaching effectiveness which were suggested and 
did not provide any alternative measures which met with their approval. A dilemma 
clearly exists. Faculty are concerned about teaching effejstiveness, even to the * 
extent of wanting it to play a major role in administrative decisions, but have no 
confidence in any measures of teaching effectiveness— including students evaluations. 
Before the potential usefulness of students' evaluations can be achieved, faculty 
and administrators (who are generally faculty, former faculty, or least "faculty- j 
.like" people) have to be willing , to 'trust Vtude'nts' evaluations. . 

An important role of. research in students' evaluations, besides demons treating 
their reliability, validity and lack of bias, is to convince faculty and admin- 
istrators, of their worth. No matter how good a measure actually is, it is of little • 
value unless it is used. Previous research has clearly demonstrated the validity of ^ 
students' evaluations against a multitude of (^niteria,''yet faculty are^still skeptical. 
Any particular validity criterion is generally either quite specific to a particular 
course (e.g., standardized performance in multi-section course of calculus or computer 
programming)br can be attacked as being inappropriate'(e.g. , alumni ratings). In the " 
present study, the criterion used was faculty self-evaluations of their own teaching. 
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In spite of the faculty reservations about the accuracy of the students' 
evaluations, the results showed good student-faculty agreement. Kot only 
' does this* study provide important new evidence for the validity of students' 
evaluations, but thiB^ findings should be instrumental in overcoming faculty 

« 

reservations about the student?' evaluations. 

Faculty self-evaluatictns in this study have been used primarily as a 
criterion for validating studenb' evaluations. However, the findings do* 
suggest that under somja. circumstances the faculty self-evaluations can 
be useful as well, factor* analysis of the faculty self-evaluations gave 
evidence of a well defined factor structure. The faculty self-evaluations gave 
showed good agreement with students'' evaluations. Even the moderat? lack 
^ of reliability of: the faculty self-evaluations might be overcome if data 
were averaged across several different courses. While it is probably unre?- 
listic to expect faculty to be objective if their self-evaluations were to 
be used fpr tenure/promotional decisions, their ratings may prove quite 
valuable to the improvement of teaching. The thought processes necessary 
to- complete the self-evaluations require that faculty carefully scrutinize 
their tea.ching. Furthermore, Centra {1972} reported that faculty-who ^ 
found that their students evaluated them much lower than they had evaluated 
themselves were more likely to benefit from the. feedback provided by students' 
evaluations. 
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TABLE ONE ■ ■ 

Factor Analyses of Students' EvalXiations and Faculty Self-Evaluations 

(N=83 Courses) 



I 



II 



III 



IV 



VI 



I. BREADTH OF COVERAGE 

Discusses other points of view 
Contrasts implications 
Discusses recent developments 

' ' Presents origins of ideas/concepts 

9 

ORGANIZATION 
Explains clearly 
Is well prepared 
Lectures easily outlined 
. ' Answers qi ^«;tions carefully 

III. GROUP INTERACTIONS 

Encourages class discussions 
Invites sharing ot^JjnowTedge/ ideas 
Invites criticisftT-^own ideas 
Knows when students confused/bored 

IV. INDIVIDUAL RAPPORT 

Has interest- in students 
Is friendly to students 
Is accessible out of class 

, V. iNSTRUeTOR. ENTHUSIASM 

Is dynamic and energetic 
Has interesting presentation style 
Is enthus'iastic about subject 
Inhances presentation with humor 
OVERALL COURSE RATING 

VI. LEARNING 

Course intellectually demanding 
^You learned something valuable 
OVERALL COURSE RATING 



71 


(59) 


"10 


86 » 


(86) 


14 


44 


(23) 


-23 


51 


(62) 


10 



06 
25 
-03 
17 



02 
07 
22 
-0^ 



-02 
08 
18 



14 
05 
10 
16 
16 
11 

11 
06 
12 



(-10) 
(17) 
(21) 
(11) 



(03) 
(06) 
(51) 
(-03) 



(-04) 
(02) 
(04) 
(07) 

(-04) 
(12) 



U3) 
C06) 
(44). 



03 
-07 
09 

22 



(-23) 
(-04) 
(02) 
(12) 



18 
25 
42. 
24 

-07 - 
27 



(08) 
(28) 
C35) 

(17) 
(32) 
(02) 



(-04) (^3 (-12) 
(02) 06 (18) 
(22) 19 (02) 



27 
-02 
10 
13 



73 ( 


[77) 


02 


80 


[51) 


-17 


70 1 


73) 


14 


60 


|04) 


30 



(00) 12 (30) 28 
(-18) 05 (24) <25 
(02) -03 (-07) '-04 



-02 
08 
-17 
-05 
12 
15 

00 
13 
19 



'9 - 
(19) 

(-15) 
(12) 

(-05) 



(-12) 
(-01) 
(-12) 
(18) 



(06) 
(35) 
(00) 



(02) 

(11) 
(04) 
(25) 
(29) 
(05) 

(-18) 
(-15) 
(05) 



13 (-U) 

-07 (-09) 

,10 (16) 

22 (32) 



01 .(-09) 

09 (16 

10 (-04) 
15 (-09) 



88 1 


[85) 


-03 


86 1 


;84) 


15 


65 1 


51) 


29 


40 \ 


l48) 


13 



69 


(44) 


78 


(35) 


67 


(41) 


23 


(10) 


-03 


(-12). 


36 


(40) 


24 


(51) 


06 


(-38) 


23 


(-23) 


12 


(23) 


-04 


(27) 


-07 


(-16) 



0^4 


(-21) 


0 


(23) 


29 


(-14) 


07 


(17) • 


33 


(15) 


-17 


(23) 


14 


(-03) 


08 


(-16) 


02 


(11) 


09 


(-061 


06 


(-14) 


sit. 




20 


(12) 


23 


(06) 


-21 


(-06) 


48 


(80) 


63 


(57) 


30 


(65) 


38 


(49) 


.74 


(29) 


38 


(64) 


'09 


(14) 


20 


(00) 


22 


(44) 



■09 (-05] 
06 (07] 



24 

26 



(24 
(13) 



09 


(12) 


18 


(11) 


08 


(04), 


03 


(75)1 


12 


(-02) 


-02 


(01.) 


01 


(-00) 


21 


(21) 


"-01 


(11) 


-20 


(-02) 


22' 


(14) 


30 


(07)-' 


27 


(24) 


08 


(-14) 


30 


(-b2) 


-05 


(16) 


30 


(19) 


72 


(61) 


68 


(54) 


6i 


(44) 



1-Factor loadings in bold boxes are loadings, for items designed to measure each Factor. 
Faculty self-evaluations are presented in parenthises. 



Results of factor analysis 



2-Factor Analyses of both sets of data consisted of a principle components analysis, kaiser normalization, and rot-^ticn 
to a direct oblimin criterion for win'ch the delta parameter was set at— 2.0. Analysis was performed with the comrnerically 
f'ai Table statistical Package for Social Scientists ^Nie. et. al, 1975). , . 



Reliabilities and inter correlations between the different factors are present.ed in the next section. 
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' ' 'CONVERGENT aI^D DISCRIMINANT VALIDITY' 
(N=83 courses evaluated by -both students and faculty) 



V - 



L ^Sf UD'eHTS^ EVALUAT IONS , 
Vlj EHTHUSIASM, . 
.' (2)' GROl(p rINT£RACTION ■ 

'(3)" leArhing 
. (4). .individual rapport 

• (5)^.'BREA0r{i OF COVERAGE 
'(6) .^0RGANIZATI0N 



(1) 

(92) 
54 
24 
39 
46 
61 



Student's Evaluations 

(2) (3) (4) , (5) (6) 



.Faculty 5elf-Evaluations 
(7) (8) (9) . (10). (11) (12) 



(93) 
04 

'46- 
35 

/28 



(87) 
05 
45 
35 



(86) 
44 
39 



(86j;:- 

' 47 (93) 



9 



^i\CULTY SELF EVALUATIONS 

(7) ENTHUSIASM 

(8) ^OUP INTERACTION 
iiy^LEARNING'" ' ' x , 



J" 



}37(-;42)| 16 , 15 02 
00 i47(.55)| -17 -01 
07 -18 138( .50) 



04 



12 
00 
16 



(la) INDIVIDUAL- RAPPC^T . 

(11) BREADTH *QF COVERAGE 

(12) 'ORGANIZATION / 



17 
12 
05 
10 



(85) 

22 (79) 



-06 rW '29''' |32(.48){ .07 
'-bz^''--]^'l'-'ZO '. -03 [26 (.33)1 05 



21 
09 
09 



'28 —13. ' 36'< >v.d7 - 22 |55(.6 7) 

i 



17 (67) 
04 21 (51) 
18' 29 
14 



-19 



04 
•24 



,(72) 

-.21 (72) 



IrValues In diagorva^Ts of upper, left and lower rig?it matri cies',; the- t\/o .triangular matricigs, are reliability estimates 
(coefficie'nt alphas). (See Nie, et. al.\ '19^77) ^ ' , ^ 

■2-Values. in diagonals of Jo we r left matrix, square matrix, are vaTidity coefficJents.X jhevaluesin parentheses h 
corrected for, uhrel.iatiil ity according to the equation: corrected x\/ = (uncorrected rxy)/y(rxj<)(ryy)' 

:b6Kffici 



,3-CorreUtiqn cberfficients- 'are pre'sente'd»without jiecimal points; correlations greater tjh^n 20 are 'statistically. 




^significcint.. 
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TABLE THREE 



— -•**v Agreement Between Faculty Self-Evaluations and 
Evaluations of Their Students 
(N=83 courses evaluated by both faculty and students) 

ABSOLUTE AGREEMENT 



Mean Faculty 
' Self-Evaluations 
BREADTH OF COVERAGE " 

Discusses other points of view 4.04 

Contrasts implications 3.96 

Discusses recent developments 4.30 

Presents Origins of ideas/concepts 4.04 

ORGAN iZAT ION " • 

Explains clearly 3.95 

, Is well prepared 4.36 

Lectures easily outlined . 3.58 

Answers questions carevully - 4.18 

-GROUP INTERACTION 

Encourages class discussions 4.11 

Invites sharing of knowledge/ideas 3.76 

Invites criticism of own ideas 3.94 

Knows when students confused 3.80 

INDIVIDUAL RAPPORT 

.Has interest in students 4.74 

Is friendly to students 4.52 

Is accessible out of class 4.06 

INSTRUCTOR ENTHUSIASM 

•Is dynamic and energetic ^ 4.17 

Has interesting presentation style 3.82 

Enjoys teaching 4.26 

Is enthusiastic about subject 4.60 

Ijihances presentations with humor 3.64 

OVERALL COURSE RATING ' 4.08 

LEARNING 

Course intellectually demanding 4.23 

You learned something valuable 4.11 

OVERALL COURSE RATING 3.79 



(Median of 24 items) 



p .05 



p .01 



(4.07 
NS-Not Signficant 



Mean Student 
Evaluations 

4.12 NS 
4.19 * 
4.16 NS 
4.11 NS 



3.99 NS> 
4.24 NS 
3.69 NS 
4.01 NS 



4.07 NS 
3.95 NS 
■3.72 NS 
3.54 * 



4.17 ** 
4.39 NS 
4.11 NS 



3.95 * 
3.78 NS 
4.49 ** 
4.45 NS 
3.81 NS 
4.07 NS 



4.07 NS 
4.23 NS 
3.82 NS 

4.07 ) 



RELATIVE AGREEMENT 
Correlation Between* 
Faculty and Student 
Evaluations 

° . + .19 * 
• + .32 ** 
+ .27 ** • 
+ X21 * 



+ :49 ** 

+ .42 ** 

+ .44 ** 

+ .05 NS 



+ .47 ** 

+ .47 ** 

+ .23 * 

+ .19 * 



+ .20 * 
+ .21 * 
+ .50 ** 



.28 
.35 
.46 
.26 
.41 
.31 



** 

**- 

■k* 
** 
** 
** 



+ -.23 * 
+ .29 * 
+ .32 * 

(+ .30 ) 



1- Two-tailed statistical tests were used in determining absolute agreement since it 
was assummed that students' evaluations may be either higher or lower than faculty 
self-evaluations. One-tailed tests were used to test relative agreement since U 
was assumed that correlations would only be positive. 

2- The correlation between the 24 mean faculty responses and the 24 mean student 
responses is .77 indicating good agreement on what teaching behaviors are more or 
less descriptive" of faculty as a whole. 

3- The correlations between faculty and student responses were not corrected for 
unreliability which is substantial for faculty self-evaluations. 

4- Evaluation factor scores were not used in this analysis since factors scores, a 
weighted average of z-scores, have a mean of 0.0 (or sqme other arbitrary value). . 



TABLE FOUR 



DIFFERENCES IN EVALUATIONS OF COURSES IN FACULTY INDICATED 
THEIR TEACHING WAS "MOST EFFECTIVE" AND "LESS EFFECTIVE" 
(H=32' "most effective' and 32 "less effective" courses) 



Evaluation Facforsl 
ENTHUSIASM 
GROUP INTERACTION 
INDIVIDUAL RAPPORT 
"BREADTH OF COVERAGE 
VALUE/LEARNING 
ORGANIZATION 
OVERALL INSTRUCTOR 
OVERALL COURSE 



Students' Evaluations 
Most Effective Less Effective 



Courses 



Faculty Self-EvalMations-- > . - 
Most Effective Less Effectived 



Courses 



Courses 



Courses 



105.3 
104.5 
103.3 
103.7 
^02.6 
103.2 
4.24 
3.96 



96.9 
99.0 ** 
98.7 ** 
66.2 ** 
98.2 * 
98.9 * 
3.98 **•■ 
3.74 ** 



102.1 
102.3 
100.7 
100.2 
104.7 
101.6 
4.27- 
4.11 



I 98.9 * 
99.0 NS 
|102.1.NS 
/ 97 .2 NS 

» 

i 

r?7:2.** 

1 97.9. * •• 
j3.97 ** 
!3.57 ** 



* p <.05; .**£<. 01, NS— Not Significant 

1- " Evaluation factors, the first six evaluation scores, were standardized (n)ean=10p, 
standard deviation=15) for students and faculty separately. The two Overall Summary items 
varied. along a five-point scale ranging from "1-Among the Worst" Jtoj "5-Among ttfe Best" 

2- Statistical^ significance was determined by a one-tailed dependent t-test, since scores, 
were predicted to be higher in the "most effective" courses on a prior basis. 

3- Multivariate significance tests, Rotelings T-Squared, indicated significant different- 
iation between the. two groups of courses with both the students' evaluations (Hotelling 
T-Square = 33.2 ; F(8,24 )= 3.3 , p<.01) and Faculty self evaluations (Hotelling T-Square 
= 31.6 ; F(8,22 )=2.9 p<.05). 

4- Results presented in this table based upon only Faculty who rated themselves and were 
rated by their students in two undergraduate courses. A total of 64 courses, 32 pairs, 
met this criteria. The remaining 19 courses were either paired with a graduate level 
course, or" were unpaired (i.e. Faculty rated only one course). 



TABLE FIVE 



BACKGROUND/DEMOGRAPHIC DIFFERENCES BETWEEN 
EFFECTIVE" AND "LESS EFFECTIVE" COURSES 


"MOST 






' (N=32 "most effective" & "less effective") 






- . \ Most Effective 
i - ^ Courses, 


. Less Effective 
Courses 


Background/Demographic Variable * 






_ 


Nuijibef of Times Instructor Had Taught Same 
[or Similar Course" 


5.24 


4.17 


NS 


-faculty Impressions of Student Interest, in Subject 
at Start of the Course (l-Very Low... 5-Very' High) 


3.41 


3.17 


NS 


.Instructor's Self-Rating of "Grading Leniency" 
(l-Vepy Easy Grader... 5-Very- Hard/Strict Grader) 


3.57 


3.57 


NS- 


Class Average of Students' Expected Grade 
(0-F....4-A) 


3.33 


3.27 


NS 


Average of Students' GPA (1-Below 2.4, 2-2.4 2-2.4 to 2^9 
3-2.9 to 3.37, 4-3.37 to 3.7, 6-Sbove 3.7) 


3.46- 


3.42 


NS 


Percentage of Upper Division Students in Cl^iss 


60% 


50% 


NS 


Percentage of Students Majoring in Division 


59% 


48% 


NS 


Percentage of Studetns Taking Course to "Fulfill 
a Major Requirment" 


51% ' 


. 46% 


NS 


Course Level (l-Lower Division, 2-Upper Division) 


1.84, 


1.66 


* 



** p .Oli * p .05; NS-Not Significant ^ 

1- Statistical significance was tested with two-tailed dependent t-tests since there was 
no apriori basis for predicting the direction of the differences 

2- Multivariate significance, Hotelings T-Square, indicated that assessing all 10 variables 
simultaneously, the differences between the two droups of courses, was not statistical ly 
significant Hotel! ing T-Square= 17.-6; F(10,20)= 1.21 p>.05) 
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